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Abstract 

We present an out-of-core hydrodynamic code for high resolution cosmological sim- 
ulations that require terabytes of memory. Out-of-core computation refers to the 
technique of using disk space as virtual memory and transferring data in and out of 
main memory at high I/O bandwidth. The code is based on a two-level mesh scheme 
where short-range physics is solved on a high-resolution, localized mesh while long- 
range physics is captured on a lower resolution, global mesh. The two-level mesh 
gravity solver allows FFTs to operate on data stored entirely in memory, which is 
much faster than the alternative of computing the transforms out-of-core through 
non-sequential disk accesses. We also describe an out-of-core initial conditions gen- 
erator that is used to prepare large data sets for cosmological simulations. The 
out-of-core code is accurate, cost-effective, and memory-efficient and the current 
version is implemented to run in parallel on shared- memory machines. I/O over- 
head is significantly reduced down to less than 10% by performing disk operations 
concurrently with numerical calculations. The current computational setup, which 
includes a 32 processor Alpha server and a 3 TB striped SCSI disk array, allows us 
to run cosmological simulations with up to 4000^ grid cells and 2000^ dark matter 
particles. 
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1 Introduction 



Presently, one of the big tasks in cosmology is to determine the concordance model moti- 
vated by the results from the cosmic microwave background (CMB), large-scale structure 
(LSS), and supernovae. The precision measurement of the matter power spectrum is one 
key scientific goal and observations of the Lyman alpha (Lya) forest, the Sunyaev-Zeldovich 
(SZ) effect, and weak lensing of the LSS are expected to complement existing constraints. In 
order to do cosmology with these probes, numerical modelling of the nonlinear physics must 
achieve a level of accuracy on par with upcoming precision observations. 

Cosmological hydrodynamic and N-body simulations are standard tools for modeling non- 
linear structure formation in the universe. Numerical simulations must converge over a large 
range in scale, mass, and temperature. Large box sizes are required to capture large-scale 
correlations and power while high resolution is needed to resolve small-scale, nonlinear struc- 
tures. Most simulations to date have sacrificed one for the other because of the limitations 
in computational resources. While parallelized numerical codes and optimized mathematical 
libraries have significantly reduced computation time on high performance computing (HPC) 
systems, memory limitations remain the bottleneck in expanding the size of simulations. 



We describe the implementation of an out-of-core hydrodynamic (OCH) cosmological code 
for high resolution simulations requiring terabytes of memory. Out-of-core computation refers 
to the technique of using disk space as virtual memory and transferring data in and out of 
main memory at high I/O bandwidth. Conventional memory is an expensive commodity and 
currently most HFC systems have a few hundred gigabytes at most. However, disk space is 
relatively cheap and disk arrays can provide many more orders of magnitude in storage. 
In order to be effective, out-of-core computation must avoid being I/O limited. Striped 
SCSI disk arrays are capable of delivering on the order of 100 — 1000 MB/s throughput 
and have sufficient bandwidth. However, the slow latency presents a major problem for 
codes requiri ng non-sequential d i sk acc ess. Pioneering work has been done for computational 
astrophysics. ISalmon &: WarrenI (|l997l ) developed a parallel, out-of-core tree N-body code to 
run an 80 million particle simulation on a distributed-memory Pentium cluster with a total of 
16 processors, 2 GB of memory, and 16 GB of disk space. The out-of-core paradigm remains 
a niche that has been relatively unexplored for numerical simulations. We demonstrate that 
advances in algorithmic development now make it feasible to do out-of-core computation 
effectively. 



The OCH code is designed for cosmological applications where high mass reso lution is re- 
quired at all scales. The code is based on the Hydro&N-body implementation of iTrac &: Pen 
(12004). The hydrodynamics of the baryoni c gas is captured using an Eulerian total variation 
diminishing (TVD) scheme (Harten, 1983? ) that provides high resolution capturing of shocks 
while preventing unphysical instabilities. The gravitatio nal evolution of the dark ma tter is 
tracked using the standard particle- mesh (PM) scheme ( Hocknev fc Eastwoodl . 1988[ l. 



The challenges in doing out-of-core computation stem from the requirement that the data 
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domain be decomposed into blocks which can fit in memory. The hydrodynamics of the ideal 
gas is a local process that is straightforward to solve on the decomposed domain with the 
addition of buffer regions. However, gravity is a global force and Poisson solvers operate on 
the global density field. Fast Fourier transforms (FFTs) are the optimal solvers in a standard 
PM scheme, but they involve global transposes and computing the transforms out-of-core 
will be intolerably slow. This problem can be addressed by splitting the gravitational force 
into long and short-rang e components, similar to that done in P^ M ( Hocknev &: Ea stwood!. 
1988t) mesh - refined PM (ICouchmanl .ll99l': 'C ouchman et al.l . [19951) , and Tree-PM (|Xu . .199,4 



BadaL 120021: ISode fc Ostrikeii . boost Dubinski et al.l . l2004l ) codes. We implement a two-level 



mesh scheme where the short-range force is solved on a high-resolution, localized mesh while 
the long-range force is captured on a lower resolution, global mesh. The two-level mesh 
gravity solver is memory-efficient and allows FFTs to be performed on data stored entirely 
in memory. 



In this paper, we describe the out-of-core algorithm and highlight the steps required for doing 
out-of-core computation. In particular, we discuss optimizations to reduce I/O overhead 
by performing disk operations and numerical calculations concurrently. The two-level mesh 
gravity solver is described and its accuracy is compared to that of a standard one-level mesh 
solver. In addition, we present an out-of-core initial conditions generator that can be used 
to construct large data sets for high resolution cosmological simulations. 



2 Out-of-Core Algorithm 



The three-dime nsional out-of-core hydro (OCH) code is based on the Hydro&N-body im- 
plementation of iTrac fc Pen! (j200J). For reference, a pedago gical primer on Eulerian compu- 
tational fluid dynamics is presented in ()Trac fc Pen . 20031 ). We briefly describe the cosmo- 
logical code and present modifications specifically added for doing out-of-core computation. 
The code is designed for cosmological applications in an Friedman- Robertson- Walker (FRW) 
universe. 



The out-of-core implementation requires both the division of data into blocks that can fit in 
memory and the decomposition of physical laws with global dependence into short and long 
range components. The code uses a two-level mesh algorithm where short-range physics is 
solved on a high-resolution, localized mesh while long-range physics is captured on a lower 
resolution global mesh. The global mesh is 4 times coarser in each dimension, reducing 
memory usage by a factor of 64 when calculating the long-range interactions. The standard 
technique of having buffer regions around the local blocks ensures that the short-range in- 
teractions are calculated with no boundary effects imposed by the domain decomposition. 
In the sections to follow, we describe the Hydro&N-body implementation and the two-level 
mesh gravity solver. 
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2.1 Hydro&N-body 



The hydrodynamics of the baryonic ga s is so lved using an Eulerian algorithm based on the 
moving frame approach of Trac fc PenI (12004 ) and the second-order accurate total variation 
diminishing (TVD) scheme ( Harten . Il983[ l. The Eulerian conservation equations are solved 
in an adaptive frame moving with the fluid where local fluid variables can be directly tracked. 
The moving frame approach minimizes the numerical Mach numbers, allowing high resolution 
capturing of shocks and preventing spurious oscillations in both subsonic and supersonic 
regions. Time integration is performed using a second-order accurate Runge -Kutta techniq ue. 
The moving frame hydro algorithm uses an operator-splitting technique ( Strang . 1968[ ) to 
efficiently solve the various flux terms in the Euler equations. To get second-order accuracy, 
each hydro iteration is composed of a double time step and the order of operations is reversed 
in the second time step. A single gravity step is done between the forward and reverse hydro 
steps. 



The gravitati onal evolution of the dark ma tter particles is tracked using the stan dard PM N- 
body s cheme ( Hocknev fc Eastwoodl . IQSSi l. We build upon the implementation of Trac fc Pen 
fl2004^ . which is based on a single- mesh gravity solver. In the two-level mesh scheme, mass 
assignment and force interpolation on both the fine and coarse grids are accomplished using 
'cloud-in-ceir (CIC) interpolation. Cubical cloud shapes resembling the fine or coarse cells 
are used for the respective grids. The coUisionless particles are advected by solving Newton's 
equations of motion in an FRW universe. Time integration is performed using an explicit, 
second-order accurate Runge-Kutta technique that allows synchronization between the dark 
matter particles and the gas. 



2.2 Gravity 



2.2.1 Two-level mesh gravity solver 

Gravity is a global force that can be decomposed into short and long range components and 
solve with a two-level mesh scheme. Poisson's equation 

= 4vrGp, (1) 

relates the gravitational potential (p to the density field p and the general solution can be 
written as 

(j){x) = J p{x')w{x - x')£x\ (2) 
where the isotropic kernel is given by 

G 

w{r) = . (3) 

r 

In the standard mesh approach, the convolution is done in Fourier space and fast Fourier 
transforms (FFTs) are used to convert the discrete density field and recover the discrete 
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potential field. Since FFTs are highly nonlocal and involve global transposes, computing the 
transforms out-of-core using non- sequential disk accesses will be intolerably slow. However, 
the two-level mesh scheme allows us to apply the FFTs to data that are entirely stored in 
memory. 

We decompose the potential kernel w{r) into a short-range component 

(4) 

I u otherwise, 
and a long-range term 




J a(r) if r < r^, 

Mr) = \ . . ^, . (5) 

lw(r) otherwise, 

where the short-range cutoff is a free parameter. The function a{r) is chosen to be a 
polynomial, 

a{r) = G{a + br'^ + cr^), (6) 

whose coefficients, 

15 
10 

^ = i;:3' (7) 



c 
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are determined from the conditions 



a(rc) = w{rc), 

a'{r,) =w'{r,), (8) 
a"{rc) = w"{rc). 

These restrictions ensure that the long-range kernel smoothly turns over near the cutoff and 
that the short-range term smoothly goes to zero at the cutoff. 

The long-range potential (f)'({x) is computed by performing the convolution over the coarse- 
grained global density field p'^{x). The superscript c denotes that the discrete fields are 
constructed on a coarse grid. The total density field has contributions from both the dark 
matter particles and baryonic gas cells. Mass assignr nent onto the coarse grid is ac complished 
using the ' cloud- in-cell' (CIC) interpolation scheme ( Hocknev k. Eastwood . 1988[ ) with cloud 



shape being the same as a coarse cell. The long-range force field fi{x) is obtained by finite 
differencing the long-range potential and force interpolation is carried out using the same 
CIC scheme to ensure no fictitious self-force. 

Since the two-level mesh scheme uses grids at different resolutions, the decomposition given 
by equations (4) and (5) needs to be modified. In Fourier space, we can write the long-range 
potential as 

Uk) = ~p'{k)w'i{k) = [p{k)~sp{k)][wi{k)~s^{k)l (9) 
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(a) (b) 

Fig. 1. The decomposition of the potential kernel (a) and radial force kernel (b) for the two-level 
mesh gravity solver. The long-range (filled squares), uncorrected short-range (small dots), and 
corrected short-range (filled circles) components are plotted. For comparison, the 1/r isotropic 
potential kernel and the isotropic force kernel are illustrated by the dashed lines. Here, the 
uncorrected short-range cutoff is rg = 16 fine grid cells and the corrected cutoff is approximately 
20 grid cells. 

where Sp{k) and Su,{k) are the Fourier transforms of the mass smoothing window Sp{x) 
and kernel sampling window s^ix), respectively. The mass smoothing window takes into 
account the CIC mass assignment scheme for constructing the coarse density field. The 
kernel sampling window corrects for the fact that the long-range kernel given by equation 
(5) is sampled on a coarse grid. In Fourier space, the corrected short-range potential kernel 
is now given by 

Ws{k) = w{k) - wi{k)sp{k)swik), (10) 
and can be slightly anisotropic, particularly near the short-range cutoff. 

Plotted in Figure 1(a) arc the various components of the potential kernel for an initial short- 
range cutoff Tc = 16 fine grid cells. For comparison, the isotropic kernel w{r) is illustrated 
by the dashed hne. The long range kernel Wi{x) is constructed on a coarse grid and the 
values in each cell are plotted with filled squares. The uncorrected and corrected short-range 
kernels w^{x) are constructed on a fine grid and plotted with small dots and filled circles, 
respectively. The corrected short-range kernel smoothly goes to zero at a radius now slightly 
larger than Tc- The new gravity cutoff bg = 20 fine grid cells sets the size of the buffer for 
the local blocks. 

For each local block in the domain decomposition, the short-range potential 0{(£c) is com- 
puted by performing the convolution over the high-resolution density field p^{x). The su- 
perscript / denotes that the local fields are constructed on a fine grid. The short-range force 
field fs{x) is then obtained by finite differencing the short-range potential. 
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Note that as an alternative, the force can be calculated directed using the convolution, 

f,ix) = J pix')w,ix - x')(fx\ (11) 

where the anisotropic force kernels are given by 

w,{x) = -G^. (12) 

The force kernels can be decomposed into short and long range components in a similar 
fashion to that described previously for the potential kernel. In Figure 1(b), the magnitude 
of the force kernels are plotted for a short-range cutoff = 16 fine grid cells. 

In the potential method, the finite differencing degrades the pair-wise force resolution by 
a few grid cells, but the net force on any given cell or particle is still highly accurate in 
general. The force method exactly reproduces the pair-wise inverse-square law on the grid, 
but not for particles since the CIC mass assignment scheme smoothes the mass and pair- 
wise force near the grid scale. Both the potential and force methods require one FFT to 
forward transform the density field, but the potential method only requires one inverse 
transform for the potential field while the force method requires three inverse transforms 
for the force components. Hence, the force method comes at the cost of two extra FFTs. 
For the high-resolution short-range force field, the relatively small accuracy trade-off of the 
potential method is preferred over the factor of 2 increase in computational work of the 
force method. For the lower-resolution long-range force field, the force method allows us to 
avoid finite differencing the coarse grid and further degrading the resolution of an already 
smoothed field. We can afford the extra factor of 2 increase in cost since the long-range force 
calculation is already a factor of 64 cheaper than the collective short-range force calculations. 



In summary, we combine the potential and force methods to generate a two-level mesh 
gravity solver that is accurate, cost-effective, and memory-efficient. The short-range force is 
computed using the potential method while the long-range force is solved directly using the 
force method. We choose an initial short-range cutoff = 16 fine grid cells and a gravity 
cutoff bfj — 20 fine grid cells. A version of this two- level mesh gravity solver is used by 



Merz. Pen, fc Trad (j200^. 



2.2.2 Pair-wise force test 

The accuracy of the two-level mesh gravity solver can be checked by performing a pair-wise 
force test. We place a particle randomly in a box, calculate both the short and long range 
force fields, and measure the pair-wise force for a number of randomly placed test particles. 
We choose 128 random centers and for each center, 10^ test particles are randomly distributed 
with equal numbers per radial bin to quantify the errors. In Figure (2) we plot the fractional 
error 

Sf ^ ^^y(^, (13) 
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Fig. 2. Fractional errors for pair- wise force test. The top and middle panels show the errors for the 
uncorrected and corrected two-level mesh gravity solver with = 16 grid cells, while the bottom 
panel shows the errors for a standard single-mesh solver. With the corrected kernel, the mean error 
for separations in the range r = it 4 is zero, the rms error is < 2%, and the maximum error is 
< 5%. 



where /(r) = G/r"^ is the magnitude of the exact force at separation r. The top and middle 
panels show the fractional errors for the uncorrected and corrected two-level mesh gravity 
solver while the bottom panel shows the errors for a standard one-level mesh gravity solver 
based on the potential method. The same random subset of pairs, with equal numbers per 
logarithmic bin, are plotted for all three cases for proper comparison. 



In all three cases, the fractional errors can approach unity near the grid scale and the source 
of the problem is the CIC mass assignment scheme. The cubical cloud shape softens the 
mass reso lution and introduces force a nisotropies. In principle, higher-order mass assignment 
schemes ( Hocknev fc Eastwood . 1988| ) like trian gular-shaped clouds (TS C) can reduce the 
force fluctuations near the grid scale. However. lEfstTthiou et all (|l985h have conducted a 
detailed study of the PM N-body method and have concluded that the CIC scheme offers 
the best balance between computational cost and force accuracy. 
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Near the initial cutoff Vc = 16 grid cells, the magnitude of the force is under predicted by the 
uncorrected gravity solver. The mass smoothing and kernel sampling on the coarse grid are 
responsible for the decrease in power. After correcting for these effects, the mean error for 
separations in the range r = Tc ± 4 is zero, the rms error is less than 2%, and the maximum 
error is less than 5%. 



3 Out-of-core Computation 

At the Canadian Institute for Theoretical Astrophysics (CITA) we have a HP/Compaq 
GS320 Alpha server with 32 processors and 64 GB of shared memory. The total floating- 
point speed has a peak value of 48 Gflops and the total memory bandwidth is 14 GB/s. The 
server has a total of 8 independent PCI busses, each running at 266 MB/s. In the current 
configuration, an array of 84 SCSI disks providing 3 TB of storage is connected to the server 
through 6 high bandwidth dual Ultra3 SCSI controllers, each with a peak bandwidth of 320 
MB/s. To improve I/O performance, we stripe the disk array using the HP/Compaq Logical 
Storage Manager software package. We can achieve read and write speeds of up to 500 MB/s 
when direct I/O is used. 

Out-of-core computation requires dividing the global data domain into smaller local blocks 
that can fit entirely in memory. We work on each block sequentially: a data block is read 
from disks into memory, processed in parallel for a number of time steps, and then written 
back to the disk array. In the following sections, we describe the domain decomposition, time 
stepping, and optimizations for doing out-of-core computation. 

3.1 Domain Decomposition 

The three-dimensional global physical domain is decomposed into cubical regions and each 
local region is extended by buffer regions from its 26 neighbours. Associated with each 
extended local region is a high resolution hydro array and a dark matter particle list. The 
collisionless dark matter particles only interact gravitationally and for each gravity step, 
a buffer length oi bg = 20 grid cells is required to calculate the short-range force without 
introducing artificial boundary effects. The moving frame hydro algorithm requires 6 buffer 
cells per single time step and a hydro buffer bh = 12 cells for a double time step. A total 
buffer size of bt = 32 grid cells is required for every double time step. Only one gravity step 
is done per double hydro step. 

The periodic global domain is represented solely by a total density field, constructed on a 
lower-resolution mesh. The global density field is used to calculate the long-range force field 
and it can also be used to provide another level of buffering for solving the short-range force 
field. To do multiple time steps while the data is in memory, one can increase the size of 
the buffer, but this is inefficient. A data block contains both the local physical region plus 
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call long_range_force 

do c=l,Nc 
do b=l,Nb 



do a= 


l,Na 


pa 1 1 


1 CO \J _U CI LCl 


call 


long_range_acceleration 


call 


hydroJorward 


call 


pm_gravity 


call 


hydro_reverse 


call 


hydroJorward 


call 


pm_gravity 


call 


hydro_reverse 


call 


write_data 


enddo 





enddo 
enddo 

Fig. 3. A schematic Fortran code illustrating the time stepping in the OCH cosmological code. We 
loop over the decomposed domain and work on each local block sequentially: a data block is read 
from disks into memory, processed in parallel for a number of time steps, and then written back to 
the disk array. 

buffers and the maximum allowable size is limited by the amount of memory in the system. 
If the buffer size is increased, then the size of the local region must decrease, resulting in 
extra work overhead when trying to solve a fixed-size problem. We can do an additional 
short-range gravity step by using the global density field to construct an additional buffer 
for the high-resolution local density field. 

3.2 Time Stepping 

A schematic Fortran code is presented in Figure 3 to illustrate the time stepping in the code. 
At the beginning of each stepping iteration, we read the global density field from the disks 
and compute the long-range force field. The global data is stored on a lower-resolution mesh 
and fits entirely in memory. The global long-range force field is divided into local blocks and 
written back to the disks. This set of tasks is performed by the subroutine long_range_force. 
The global physical domain is decomposed into Na x Nb x Nc number of local regions and 
we work on each sequentially. The subroutine read_data creates an extended local block by 
reading local and buffer data from files on disks. We then update the velocities of the gas 
and dark matter particles using the long-range force field. This is done with the subroutine 
long_range_acceleration. 

A forward hydro sweep, gravity step, and reverse hydro sweep is then performed with the 
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subroTitincs hydro -forward, pm_gravity, and hydro_reverse, respectively. In the gravity step, we 
construct a high resolution total density field from the synchronized gas and dark matter. 
The extended local block already contains a buffer of length bf. In addition, the density field 
has another level of buffering drawn from the global density field. The gas velocities and the 
dark matter particle velocities and positions are advanced by two timesteps in the gravity 
step. In our multi time step scheme, the short-range tasks are repeated again. 

The data for the next stepping iteration is prepared in the subroutine write_data. A new local 
block and new buffer blocks are written to individual files. In addition, the gas and dark 
matter arc used to construct a portion of the lower resolution global density field to be used 
by the subroutine long_range_force in the next stepping iteration. 



In total for each stepping iteration, the gas and dark matter are advanced by 4 time steps 
while the data is in memory. To be second-order accurate, the value of the time step is fixed 
during a double step and also between the last double step and the first double step of the 
next stepping iteration. The first criterion is imposed by the hydro algorithm and the short- 
range acceleration while the the latter criterion is required by the long-range acceleration. 
The value of the time steps between consecutive double steps in each stepping iteration can 
be different. 



3.3 Optimizations 



The OCH code is optimized in several ways. In order to be feasible, out-of-corc computation 
must avoid being I/O limited. We can significantly reduce I/O overhead by performing disk 
operations concurrently with numerical calculations. The first local block of data is read into 
memory and while computation is being done on it, we simultaneously read in the second 
local block. While doing computation on the second local block, we write out the first block 
and read in the third block. One thread in our multi-processor machine is assigned to doing 
I/O. With this scheme, we have managed to effectively hide the disk operations and reduce 
I/O overhead to less than 10%. 

We have also significantly reduced the amount of disk operations with the two-level mesh 
gravity solver and the multi-stepping scheme. The two-level mesh gravity solver allows FFTs 
to operate on data stored entirely in memory, which is much faster than the alternative of 
computing the transforms out-of-corc through numerous non-sequential disk accesses. The 
multi-stepping scheme allows us to advance the gas and dark matter by 4 time steps before 
writing the data back to disks. 
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4 Cosmological Initial Conditions 



Cosmological initial con ditions for large simulations are also generated out-of-core using a 



two-level mesh scheme (jPenl . 119971 ). In the standard method, the overdensity field S{x) is 



constructed from the convolution, 

6{x) = I n{x')w{x - a;')rfV, (14) 



of a random white noise field n{x) with a density kernel w{x), whose Fourier transform is 
the square root of the initial matter power spectrum Pi{k). In the two-level mesh scheme, the 
density kernel is decomposed into short-range (Eq. [4]) and long-range (Eq. [5]) components. 
The density cutoff is denoted to differentiate it from the gravity cutoff Vg. In this case, 
the function a{r) is chosen to be 

a{r) = (1 -|- ar^ -|- -|- cr^)w{r), (15) 

with coefficients, 



a 2 5 



3 



b=^^ (16) 



^5 



^ 6 ' 



d 

that satisfy the conditions given by equation (8). This parametrization is more generic in 
that the coefficients are independent of the global density kernel w{r) and its derivatives. 

After the matter density field has been determined, the gravitational force field is calculated 
and used to construct the comoving displacement field 

and the proper peculiar velocity field 

x) = a—— = a— ax, (18) 
' dt D ' ^ ' 

where p is the comoving mean density, D{t) is the linear growth factor, and D is its time 
derivative. The dark matter particles are displaced from a uniform distribution and their 
velocities are determined by interpolating from the grid. The gas distribution is taken to 
trace the matter distribution and its initial conditions are readily obtained from the grid- 
constructed fields. The Poisson solver in the two-level mesh cosmological initial conditions 
generator is based on the force method, which can calculate the gravitational force field 
exactly for the grid-constructed matter density field. 
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(a) (b) 

Fig. 4. The decomposition of the density function (sohd hnes) for the two-level mesh initial con- 
ditions generator. The uncorrected short-range (dotted), corrected short-range (short dash), and 
long-range (long dash) components are plotted. The real space functions with a short-range cutoff 
Vg = 30 fine grid cells are shown in (a) and the Fourier space versions are in (b). 



The two-level cosmological initial conditions generator is tested with the following exercise. 
Consider a periodic simulation box of comoving length L = 100h~^ Mpc that is discretized 
by a 512^ grid. The global domain is divided into 2x2x2 number of cubical local regions. 
Each local block is extended by buffers and has a length of 256 + 2bt grid cells. The density 
cutoff and gravity cutoffs are chosen to be = 30 and Vg = 28 grid cells, respectively, and 
a total buffer size bt = 32 grid cells is needed after correcting the short-range kernels using 
equation(lO). The gas is discretized on the 512^ grid and the dark matter is represented by 
256^ particles. 



In Fourier space, the isotropic density function w{k) is taken to be the square root of the 
initial matter power spectrum Pi(k ). The matter transfer function is computed using CMB- 
FAST ISeliak fc ZaldarriagaL Il99fi[ ) and with WMAP parameters: !]^ = 0.27, Oa = 0.73, 
rib = 0.044, n = 1, as = 0.84, and ho = 0.7 (|SDergel et all l2003[ ). The initial conditions 
are generated for an initial redshift 2; = 50 where the matter power spectrum is still linear 
and the real space density field has |5max| < 1- The density function w{k) is integrated to 
obtain the real space density function w{r), which is then decomposed into short and long 
range terms for the two-level generator. The decomposition is shown in Figure 4. For = 30 
grid cells, the corrected and uncorrected short-range terms are very similar. The correction 
is only noticeable on scales very near the cutoff. 



Plotted in Figure 5(a) is the power spectrum from a sample realization of the initial matter 
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Fig. 5. Sample realization from the two-level mesh cosmological initial conditions generator. In (a), 
the realization power spectrum (solid line) of the grid-constructed matter density field is plotted 
against the template CMBFAST power spectrum (dotted line), out to the grid Nyquist frequency. 
In (b), the bias and cross-correlation between the dark matter particles and the matter density 
field are plotted out to the mean interparticle spacing. The two-level mesh generator agrees with a 
standard one-level version (dashed lines) to 1% at all scales. 

power spectrum. The dimensionless power spectrum is defined as 

A^(^) = ^^k'm, (19) 

where k = 27r/A. The reaUzation power P{k) is calculated using FFTs and averaged in bins 
that span the range k ± Ak/2 and have widths AA; = 2tx/ L. For a Gaussian random field, 
the fractional error in the realization power spectrum is given by 



AP 



p{k) \ N{ky 



(20) 



where N{k) is the number of sampling modes for wavenumber bin k. In a finite-volume 
periodic box, we have N{k) oc k'^. The grid-constructed matter density field (solid) has a 
power spectrum that statistically matches the template CMBFAST spectrum (dotted). At 
large wavenumbers or small scales, the agreement is very good due to the high number of 
sampling modes per bin. The deviations at small wavenumbers or large scales are expected 
due to sample variance. 

Figure 5(b) compares the realization grid power spectrum Pgg{k) and the realization par- 
ticles power spectrum Ppp{k), out to the mean interparticle spacing. To measure the dark 
matter power spectrum, the 256^ particles are mapped onto the 512^ grid using the CIC 
mass assignment scheme. The statistical correlation between the particles and grid can be 
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quantified with the bias 



and cross-correlation 



r{k) 



(21) 



(22) 



where Pgpik) = {5g{k)5p{k)) is the cross spectrum. The cross-correlation coefficient or 
stochasticity parameter can have values — 1 < r < 1. On large scales, the particles are 
perfectly correlated with the grid and unbiased. On small scales, finite resolution reduces the 
correlation and power. The turnover at small scales arises mainly because of the CIC mass 
assignment scheme. 

The two-level initial cosmological conditions generator (solid lines) can be directly compared 
to a standard one-level version (dashed lines) by running the latter on the same noise field 
used in the former. For both the grid-constructed matter density field and the particle density 
field, the two-level generator agrees with the standard one-level version to 1% at all scales. 



5 Cosmological Simulations 



The OCH cosmological code is applied to evolving the sample initial conditions generated in 
the previous section to demonstrate that it accurately simulates nonlinear structure forma- 
tion in the universe. The configuration where the global domain is decomposed into 2x2x2 
number of cubical blocks used in the generation of the initial conditions is retained for the 
out-of-core simulation. The dynamical evolution is checked by measuring the gas and dark 
matter mass power spectra at redshifts 2; = 0, 1, 3, and 7. Power spectra for the periodic 
density fields are computed using FFTs. For each redshift, the 8 local blocks are combined 
to construct the complete gas density field on a 512^ grid. The clustered distribution of 256^ 
dark matter particles is mapped onto a higher resolution 1024^ grid to improve the accuracy 
of the measurements. 



In Figure 6 the simu lated power spectra are plotted against the nonlinear fitting functions 
of lSmith et J1 (120031). Note that we use CMBFAST to compute t he ini tial transfer function 
while Smith et al. ( 2003[ l use that of Efstathiou. Bond, fc Whit^ ( 19921 ) and therefore, some 
differences are expected. The simulated dark matter and predicted matter spectra are in 
good agreement, except at scales near the box size and the grid spacing. At redshift 2; = 0, 
the comparison is good on frequencies k < 3h Mpc~^ or comoving wavelengths A > 2h~^ 
Mpc. More than half the power is lost on scales less than 5 grid cells because the force 
resolution in the PM scheme is softened near the grid scale. At large scales approaching the 
box size, the evolution is consistent with linear growth and the differences are due to sample 
variance in the initial conditions. The observed deviations are expected since the realization 
initial dark matter power spectrum has a deficit in power at large scales to begin with (see 
Figure 5). On linear and moderately nonlinear scales, the simulated gas and dark matter are 
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Fig. 6. Nonlinear evolution of the dark matter (long dashed lines) and gas (short dashed lines) 
power spectra at redshifts z = 0, 1, 3, and 7 from an out-of-core simulation with 512'^ grid cells and 
256^ dark matter particles in a 100 /i~^Mpc box. The simulated dark matter power s pectra agree 
with t he matter power spectra (solid lines) predicted using the fitting functions of ISmith et al 



(|2003f ). Poisson noise have been subtracted from the dark matter power spectra, except at z = 7 
where the power does not exceed the shot noise power (dotted line). 

highly correlated with no bias. On nonlinear scales, the gas loses power because the pressure 
prevents gravitational collapse and the force softening of the dark matter. The two-level 
mesh out-of - core r esults are very similar to that produced with the Hydro&N-body code of 



Trac Sz Peru ()2004 ). differing by less than 5% across the entire wavelength range and some 



deviation is expected because of the Poisson noise. 



5.1 Code comparison 



Heitmann et al.l (|2004^ compared several N-body codes and have found that, in general, the 
codes show agreement over a wide range of scales at the 5% level or bett er. The com - 



parison included the FFT-based grid code MC^ the AMR code FLASH (jFrvxell et al 



16 







MC^ 

FLASH 

GADGET 


\ \ 

\ \ - 
\ \ - 

^ \ - 


++\ \ \ 1 1 1 1 1 1 \ 1 1 1 1 1 1 


■ H \ h++ 

/ : 


~ 1 






k (h Mpc-i) 



k (h Mpc-i) 



(a) 



(b) 



Fig. 7. The bias and cross-correlation with respect to the PMM results for box sizes L of 90 Mpc 
(a) and 360 Mpc (b), respectively. The dotted line represents the particle Nyquist frequency. The 
grid codes PMM, MC^, and FLASH had comoving grid spacings of L/1024 while the tree code 
GADGET used a comoving softening length of 20 kpc. 



the adaptive P^M code Hydra (ICouchman et al.l Il995l) . and the tree codes HOT 
flWarren fc Salmonl . Il993l ). GADGET (|SDringel et all bOOlL and TPM (|Bode fc Ostrikei . 
20031 ). We have taken the pubhcly-available initial conditions for two cosmological simula- 
tions of a ACDM universe with box sizes of 90 Mpc and 360 Mpc {h = 0.71), respectively, 
and ran them with the N-body algorithm used in the OCH code. We will refer to this N-body 
code as the particle-mu lti-mesh (PMM) code and an application of the code is presented in 
McDonald etZI ^200^ . 



The simulations were run from redshift 2; = 50 down to z = with 256'^ particles on an 
effective grid of 1024^ cells. The grid cell spacing was chosen to match that used by the grid 
codes MC^ and FLASH. In Figure 7, our z = results are compared with those of MC^ and 
FLASH, as well as with the higher resolution results of GADGET. The particles are mapped 
onto a 1024^ grid using the NGP assignment scheme, Fourier transformed using FFTs, and 
power spectra are calculated while taking into account shot noise. The bias b{k) and cross- 
correlation r{k) are measured with respect to the PMM results. While the PMM, MC^, and 
FLASH simulations share the same minimum mesh spacing, the former two achieve slightly 
higher force resolution than the latter. The differences are probably due to the fact that 
FLASH only refines in relevant regions and the effective force softening on an adaptive mesh 
can be higher than that of an FFT-based Poisson solver. Both the bias and cross- correlation 
between the PMM and MC^ results are unity up to the particle Nyquist frequency and 
only deviate by less than 5% out to the grid Nyquist frequency. Note that the NGP mass 
assignment scheme and Poisson noise subtraction are partly responsible for magnifying the 
differences near the grid Nyquist frequency. The two FFT-based codes differ in their time- 
integration schemes, but we suspect they share similar gravity kernels, and the latter is a 
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stronger influence on the dynamical evolution and therefore, the origin of the high agreement 
in the results. 

The PMM simulations were run at lower resolution than the GADGET simulations and the 
effect of force softening can be seen in Figure 7. At the particle Nyquist frequency, the PMM 
results show a reduction of ~ 20% in power and the correlation is weakened by ~ 10%. By 
decreasing the ratio of the grid cell spacing to the mean interparticle spacing, one can further 
reduce the effects of force softening in a FFT-based gravity solver. However, this comes at 
a cost in both work and memory. In practice, particle-mesh type codes like PMM are more 
optimal for simulations which require high mass resolution with moderate spatial resolution. 



6 Future Work 



The Beowulf clusters are becoming the dominant HPC systems in computational astrophysics 
because of the unrivaled speeds that can easily reach into the teraflops. These systems remain 
memory-limited, but simulations can be run out-of-core. IDE hard drives are relatively cheap 
and tens to hundreds of terabytes of disk space can be added. Hardware or software raiding 
can be used to prevent the loss of data in the expected event of disk failures. For the 
distributed-memory cluster, the implementation will require a multi-level mesh scheme with 
parallelization across nodes using the Message Passing Interface (MPI) libraries. With an out- 
of-core, distributed-memory code, we will be capable of running hydrodynamic simulations 
with a trillion grid cells and particles. 



7 Conclusions 



We have developed an out-of-core hydrodynamic code and an out-of-core initial conditions 
generator for high resolution cosmological simulations that require terabytes of memory. The 
OCH code utilizes a two-level mesh gravity solver and a multi-stepping scheme that signif- 
icantly reduce the amount of disk operations. In addition, we have managed to reduce I/O 
overhead down to less than 10% by performing disk operations concurrently with numeri- 
cal calculations. The code is cost-effective and memory- efficient. It has been demonstrated 
to accurately simulate the nonlinear structure formation in the universe and will provide 
high mass resolution for cosmological applications. The first apphcation of the OCH code 
involves simulating the high redshift intergalactic medium in a WMAP cosmology (Trac & 
Pen 2004 in prep). We are currently running an out-of-core simulation with 2016^ grid cells 
and 1008"^ dark matter particles in a 50h^^ Mpc box. This high resolution simulation has 
a comoving grid spacing of Ax = 25h~^ kpc and a dark matter particle mass resolution of 
Am = 7.7 X W^h-^Mr^. 
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